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Abstract. Real-time object detection is one of the core problems in computer vi- 
sion. The cascade boosting framework proposed by Viola and Jones has become 
the standard for this problem. In this framework, the learning goal for each node 
is asymmetric, which is required to achieve a high detection rate and a moderate 
false positive rate. We develop new boosting algorithms to address this asymmet- 
ric learning problem. We show that our methods explicitly optimize asymmetric 
loss objectives in a totally corrective fashion. The methods are totally corrective 
in the sense that the coefficients of all selected weak classifiers are updated at 
each iteration. In contract, conventional boosting like AdaBoost is stage-wise in 
that only the current weak classifier's coefficient is updated. At the heart of the 
totally corrective boosting is the column generation technique. Experiments on 
face detection show that our methods outperform the state-of-the-art asymmetric 
boosting methods. 

1 Introduction 

Due to its important applications in video surveillance, interactive human-machine in- 
terface etc, real-time object detection has attracted extensive research recently [1-6]. 
Although it was introduced a decade ago, the boosted cascade classifier framework of 
Viola and Jones [2] is still considered as the most promising approach for object detec- 
tion, and this framework is the basis which many papers have extended. 

One difficulty in object detection is the problem is highly asymmetric. A common 
method to detect objects in an image is to exhaustively search all sub-windows at all 
possible scales and positions in the image, and use a trained model to detect target 
objects. Typically, there are only a few targets in millions of searched sub-windows. 
The cascade classifier framework partially solves the asymmetry problem by splitting 
the detection process into several nodes. Only those sub-windows passing through all 
nodes are classified as true targets. At each node, we want to train a classifier with a 
very high detection rate {e.g., 99.5%) and a moderate false positive rate {e.g., around 
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50%). The learning goal of each node should be asymmetric in order to achieve optimal 
detection performance. A drawback of standard boosting like AdaBoost in the context 
of the cascade framework is it is designed to minimize the overall false rate. The losses 
are equal for misclassifying a positive example and a negative example, which makes it 
not be able to build an optimal classifier for the asymmetric learning goal. 

Many subsequent works attempt to improve the performance of object detectors 
by introducing asymmetric loss functions to boosting algorithms. Viola and Jones pro- 
posed asymmetric AdaBoost [3], which applies an asymmetric multiplier to one of the 
classes. However, this asymmetry is absorbed immediately by the first weak classifier 
because AdaBoost's optimization strategy is greedy. In practice, they manually apply 
the n-th root of the multiplier on each iteration to keep the asymmetric effect through- 
out the entire training process. Here n is the number of weak classifiers. This heuristic 
cannot guarantee the solution to be optimal and the number of weak classifiers need 
to be specified before training. AdaCost presented by Fan et al. [7] adds a cost adjust- 
ment function on the weight updating strategy of AdaBoost. They also pointed out that 
the weight updating rule should consider the cost not only on the initial weights but 
also at each iteration. Li and Zhang [8] proposed FloatBoost to reduce the redundancy 
of greedy search by incorporating floating search with AdaBoost. In FloatBoost, the 
poor weak classifiers are deleted when adding the new weak classifier. Xiao et al. [9] 
improved the backtrack technique in [8] and exploited the historical information of pre- 
ceding nodes into successive node learning. Hou et al. [10] used varying asymmetric 
factors for training different weak classifiers. However, because the asymmetric factor 
changes during training, the loss function remains unclear. Pham et al. [11] presented 
a method which trains the asymmetric AdaBoost [3] classifiers under a new cascade 
structure, namely multi-exit cascade. Like soft cascade [12], boosting chain [9] and 
dynamic cascade [13], multi-exit cascade is a cascade structure which takes the histori- 
cal information into consideration. In multi-exit cascade, the n-th node "inherits" weak 
classifiers selected at the preceding n — 1 nodes. Wu et al. [14] stated that feature selec- 
tion and ensemble classifier learning can be decoupled. They designed a linear asym- 
metric classifier (LAC) to adjust the linear coefficients of the selected weak classifiers. 
Kullback-Leibler Boosting [15] iteratively learns robust linear features by maximizing 
the Kullback-Leibler divergence. 

Much of the previous work is based on AdaBoost and achieves the asymmetric 
learning goal by heuristic weights manipulations or post-processing techniques. It is 
not trivial to assess how these heuristics affect the original loss function of AdaBoost. 
In this work, we construct new boosting algorithms directly from asymmetric losses. 
The optimization process is implemented by column generation. Experiments on toy 
data and real data show that our algorithms indeed achieve the asymmetric learning 
goal without any heuristic manipulation, and can outperform previous methods. 

Therefore, the main contributions of this work are as follows. 

1. We utilize a general and systematic framework (column generation) to construct 
new asymmetric boosting algorithms, which can be applied to a variety of asym- 
metric losses. There is no heuristic strategy in our algorithms which may cause 
suboptimal solutions. In contrast, the global optimal solution is guaranteed for our 
algorithms. 



3 



Unlike Viola- Jones' asymmetric AdaBoost [3], the asymmetric effect of our meth- 
ods spreads over the entire training process. The coefficients of all weak classifiers 
are updated at each iteration, which prevents the first weak classifier from absorb- 
ing the asymmetry. The number of weak classifiers does not need to be specified 
before training. 

2. The asymmetric totally-corrective boosting algorithms introduce the asymmetric 
learning goal into both feature selection and ensemble classifier learning. Both the 
example weights and the linear classifier coefficients are learned in an asymmetric 
way. 

3. In practice, L-BFGS-B [16] is used to solve the primal problem, which runs much 
faster than solving the dual problem and also less memory is needed. 

4. We demonstrate that with the totally corrective optimization, the linear coefficients 
of some weak classifiers are set to zero by the algorithm such that fewer weak 
classifiers are needed. We present analysis on the theoretical condition and show 
how useful the historical information is for the training of successive nodes. 



2 Asymmetric losses 

In this section, we propose two asymmetric losses, which are motivated by asymmetric 
AdaBoost [3] and cost-sensitive LogitBoost [17], respectively. 
We first introduce an asymmetric cost in the following form: 

( Ci if y = +1 andsign(F(a;)) = -1, 
ACost = I C 2 if y = -1 and sign(F(x)) = +1, 
[0 if y = sign(F(x)). 

Here x is the input data, y is the label and F(x) is the learned classifier. Viola and Jones 
[3] directly take the product of ACost and the exponential loss £or,r[exp(— yF{x)] as 
the asymmetric loss: 

E x ,Y[(l(y = l)Ci + I(y = -1)C 2 ) cxp ( - yF(x))}, 

where I(-) is the indicator function. In a similar manner, we can also form an asymmet- 
ric loss from the logistic loss £x,r[logit( — yF(x))]: 

ALossx - E x>Y [(l(y = l)d + I(y = -l)C 2 )logit(yF(x))}, (1) 

where logit(a;) = log(l + exp(— x)) is the logistic loss function. 

Masnadi-Shirazi and Vasconcelos [17] proposed cost-sensitive boosting algorithms 
which optimize different versions of cost-sensitive losses by the means of gradient de- 
scent. They proved that the optimal cost-sensitive predictor minimizes the expected 
loss: 

-E x , Y [I(y - 1) log(p c (a:)) + I(y = -1) log(l - p c (x))\, 

t ^ e^)+" Ci+C 2 1. C 2 
where p c (a;) = — t^—t, — , with 7= ,??=^log— . 

With fixing 7 to 1, the expected loss can be reformulated to 

ALoss 2 = E x , Y [logit(yF(x) + 2y V )}. (2) 
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3 Asymmetric totally-corrective boosting 

In this section, we construct asymmetric totally-corrective boosting algorithms (termed 
AsymBoostxc here) from the losses (1) and (2) discussed previously. In contrast to the 
methods constructing boosting-like algorithms in [17], [18] and [19], we use column 
generation to design our totally corrective boosting algorithms, inspired by [20] and 
[5]. 

Suppose there are M training examples (Mi positives and M 2 negatives), and the 
sequence of examples are arranged according to the labels (positives first). The pool % 
contains N available weak classifiers. The matrix H e Z MxN contains binary outputs 
of weak classifiers in H for training examples, namely Hij — hj (xi). We are aiming to 
learn a linear combination F w (-) = J2jLi Wihj(-). C\ and C 2 are costs for misclassify- 
ing positives and negatives, respectively. We assign the asymmetric factor k = C2/C1 
and restrict 7 = (Ci + C2V2 to 1, thus C\ and C2 are fixed for a given k. 

The problems of the two AsymBoost TC algorithms can be expressed as: 

M 

min } yogit/zi) + OlJ w s.t. w >p 0, z$ = yiHiW, (3) 

i=l 

where I = [Ci/Mi, • • • , C 2 /M 2 , • • • ] T , and 

M 

min^ ejlogitfzj + 2j/j?j) + 8~)Jw s.t. w >p= 0, z l = yiH t w, (4) 

II! ' ' 



where e = [1/Mi, • • • , 1/M 2 , • • • ] T . In both (3) and (4), Zi stands for the margin of the 
z-th training example. We refer (3) as AsymBoostxci an d (4) as AsymBoostTC2- Note 
that here the optimization problems are £i-norm regularized. It is possible to use other 
format of regularization such as the ^ 2 -norm. 

First we introduce a fact that the Fenchel conjugate [21] of the logistic loss function 
logit(a;) is 



logit*(u) = 



(-u) log(-it) + (1 + u) log(l + u), > u > -1; 
00, otherwise. 



Now we derive the Lagrange dual [21] of AsymBoostxci - The Lagrangian of (3) is 

M M 

L(w^z,\u) = ^2 yogit(zj) + 01 T w - A T tu + ^2 u i( z i - Vi H i w )- 

primal dual * — 1 * — 1 

The dual function 

g(X, u) = inf L(w, z, A, u) 

M M 

= - ^ SU P ( - u i z i - ^logit(zj)^ + inf (ei 1 - A T - ^ UiUiH^j w. 



i i logit*(-u j /J 4 ) mustbeO 
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The dual problem is 

M 



max — 

M 



^2 [ u * lo &( u i) + (k - u i) lo g(k - u i 
i=l 

s.t. Y, u iVi H i 4 01 T , 4 u 4 I- (5) 



Since the problem (3) is convex and the Slater's conditions are satisfied [21], the 
duality gap between the primal (3) and the dual (5) is zero. Therefore, the solutions of 
(3) and (5) are the same. Through the KKT condition, the gradient of Lagrangian (5) 
over primal variable z and dual variable u should vanish at the optimum. Therefore, we 
can obtain the relationship between the optimal value of z and u: 

..* _ hcxp(-z*) 



l + cxp(-z*)' 

Similarly, we can get the dual problem of AsymBoostTC2, which is expressed as: 

M 



max - [ u i l°g( M j) + ( e i - u i) l°g( e i - u i) + ZuiViV 

i=l 

M 

s.t. u iVi H i 4&l T ,0 4u4e, (7) 



i=i 



with 



ei exp(-z* - 2 Vi ri) 



1 1 + exp(-< - 2y tV ) ' 

In practice, the total number of weak classifiers, 7Y, could be extremely large, so we 
can not solve the primal problems (3) and (4) directly. However equivalently, we can 
optimize the duals (5) and (7) iteratively using column generation [20]. In each round, 
we add the most violated constraint by finding a weak classifier satisfying: 

M 

h*(-) = argmax VVy^a^). (9) 

*(■) 

This step is the same as training a weak classifier in AdaBoost and LPBoost, in which 
one tries to find a weak classifier with the maximal edge (i.e. the minimal weighted er- 
ror). The edge of hj is defined as J^fti u iUihj {xi), which is the inverse of the weighted 
error. Then we solve the restricted dual problem with one more constraint than the pre- 
vious round, and update the linear coefficients of weak classifiers (w) and the weights 
of training examples (u). Adding one constraint into the dual problem corresponds to 
adding one variable into the primal problem. Since the primal problem and dual prob- 
lem are equivalent, we can either solve the restricted dual or the restricted primal in 
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Algorithm 1 The training algorithms of AsymBoost TC1 and AsymBoost TC2 . 

Input: A training set with M labeled examples (Mi positives and M2 negatives); termi- 
nation tolerant e > 0; regularization parameter 6; asymmetric factor k; maximum 
number of weak classifiers iV max . 

1 Initialization: N = 0; w = 0; and m = h/2 or e»/(l + k~ Vi ) , i = 1- • • M. 

2 for iteration = 1 : -/V max do 

3 — Train a weak classifier h' (■) — argmmaxw n J2ff =1 Uiyih(xi). 

4 — Check for the termination condition: 

if iteration > 1 and J2fLi iHVih' {xt) < + e, then break; 

5 — Increment the number of weak classifiers N = N + 1. 

6 — Add h'(-) to the restricted master problem; 

7 — Solve the primal problem (3) or (4) (or the dual problem (5) or (7)) and update 
Mi (i = 1 • • • M) and w 3 (j = 1 • • • TV) . 

Output: The selected weak classifiers are hi, hi, . . . , h,N- The final strong classifier is: 
F ( x ) = Ef=i w j Mal- 



practice. The algorithms of AsymBoostxci and AsymBoostTC2 are summarized in Al- 
gorithm 1 . Note that, in practice, in order to achieve specific false negative rate (FNR) 
or false positive rate (FPR), an offset b is needed to be added into the final strong clas- 
sifier: F(x) — X^/=i w i^j( x ) — b, which can be obtained by a simple line search. The 
new weak classifier h'(-) corresponds to an extra variable to the primal and an extra 
constraint to the dual. Thus, the minimal value of the primal decreases with growing 
variables, and the maximal value of the dual problem also decreases with growing con- 
straints. Furthermore, as the optimization problems involved are convex, Algorithm 1 
is guaranteed to converge to the global optimum. 

Next we show how AsymBoostxc introduces the asymmetric learning into feature 
selection and ensemble classifier learning. Decision stumps are the most commonly 
used type of weak classifiers, and each stump only uses one dimension of the features. 
So the process of training weak classifiers (decision stumps) is equivalent to feature se- 
lection. In our framework, the weak classifier with the maximum edge (i.e. the minimal 
weighted error) is selected. From (6) and (8), the weight of i-th example, namely m, 
is affected by two factors: the asymmetric factor k and the current margin z t . If we set 
k = 1, the weighting strategy goes back to being symmetric. On the other hand, the co- 
efficients of the linear classifier, namely w, are updated by solving the restricted primal 
problem at each iteration. The asymmetric factor k in the primal is absorbed by all the 
weak classifiers currently learned. So feature selection and ensemble classifier learning 
both consider the asymmetric factor k. 

The number of variables of the primal problem is the number of weak classifiers, 
while for the dual problem, it is the number of training examples. In the cascade clas- 
sifiers for face detection, the number of weak classifiers is usually much smaller than 
the number of training examples, so solving the primal is much cheaper than solving 
the dual. Since the primal problem has only simple box-bounding constraints, we can 
employ L-BFGS-B [16] to solve it. L-BFGS-B is a tool based on the quasi-Newton 
method for bound-constrained optimization. 
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Instead of maintaining the Hessian matrix, L-BFGS-B only needs the recent several 
updates of values and gradients for the cost function to approximate the Hessian matrix. 
Thus, L-BFGS-B requires less memory when running. In column generation, we can 
use the results from previous iteration as the starting point of current problem, which 
leads to further reductions in computation time. 

The complementary slackness condition [21] suggests that XjWj = 0. So we can 
get the conditions of sparseness: 

If A = 9 - J2fLi u l y l H i , J > 0, then Wj = 0. (10) 

This means that, if the weak classifier hj(-) is so "weak" that its edge is less than 
9 under the current distribution u, its contribution to the ensemble classifier is "zero". 
From another viewpoint, the i?i-norm regularization term in the primal (3) and (4), leads 
to a sparse result. The parameter 9 controls the degree of the sparseness. The larger 9 
is, the sparser the result would be. 

4 Experiments 

4.1 Results on synthetic data 

To show the behavior of our algorithms, we construct a 2D data set, in which the pos- 
itive data follow the 2D normal distribution (N(0, 0.11)), and the negative data form 
a ring with uniformly distributed angles and normally distributed radius (N(1.0, 0.2)). 
Totally 2000 examples are generated (1000 positives and 1000 negatives), 50% of data 
for training and the other half for test. We compare AdaBoost, AsymBoostxci and 
AsymBoostTC2 on this data set. All the training processes are stopped at 100 decision 
stumps. For AsymBoostxci and AsymBoostxC2, we fix 9 to 0.01, and use a group of 
fc's {1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8, 3.0}. 

From Figures 1 (1) and (2), we find that the larger k is, the bigger the area for 
positive output becomes, which means that the asymmetric LogitBoost tends to make 
a positive decision for the region where positive and negative data are mixed together. 
Another observation is that AsymBoostxci and AsymBoostTC2 have almost the same 
decision boundaries on this data set with same fc's. 

Figures 1 (3) and (4) demonstrate trends of false rates with the growth of asym- 
metric factor (fc). The results of AdaBoost is considered as the baseline. For all fc's, 
AsymBoostxci and AsymBoostxc2 achieve lower false negative rates and higher false 
positive rates than AdaBoost. With the growth of fc, AsymBoostxci and AsymBoostxc2 
become more aggressive to reduce the false negative rate, with the sacrifice of a higher 
false positive rate. 

4.2 Face detection 

We collect 9832 mirrored frontal face images and about 10115 large background im- 
ages. 5000 face images and 7000 background images are used for training, and 4832 
face images and 3115 background images for validation. Five basic types of Haar fea- 
tures are calculated on each 24 x 24 image, and totally generate 162336 features. Deci- 
sion stumps on those 162336 features construct the pool of weak classifiers. 
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Fig. 1: Results on the synthetic data for AsymBoostxci and AsymBoostTC2, with a 
group of asymmetric factor fcs. As the baseline, the results for AdaBoost are also shown 
in these figures. (1) and (2) demonstrate decision boundaries learned by AsymBoostTci 
and AsymBoostTC2, with fc is 2.0 or 3.0. The x's and CPs stand for training neg- 
atives and training positives respectively. (3) and (4) demonstrate false rates (FR), 
false positive rates (FPR) and false negative rates (FNR) on test set with a group of 
fcs (1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8 or 3.0), and the corresponding rates for Ad- 
aBoost is shown as dashed lines. 



Single-node detectors Single-node classifiers with AdaBoost, AsymBoostxci and 
AsymBoostxc2 are trained. The parameters 9 and fc are simply set to 0.001 and 7.0. 
5000 faces and 5000 non-faces are used for training, while 4832 faces and 5000 non- 
faces are used for test. The training/validation non-faces are randomly cropped from 
training/validation background images. 

Figure 2(1) shows curves of detection rate with the false positive rate fixed at 0.25, 
while curves of false positive rates with 0.995 detection rate are shown in Figure 2 (2). 
We set the false positive rate fixed to 0.25 rather than the commonly used 0.5 in order 
to slow down the increasing speed of detection rates, otherwise detection rates would 
converge to 1.0 immediately. The increasing/decreasing speed of detection rate/false 
positive rate is faster than reported in [8] and [9]. The reason is possibly that we use 
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10000 examples for training and 9832 for testing, which are smaller than the data used 
in [8] and [9] (18000 training examples and 15000 test examples). We can see that 
under both situations, our algorithms achieve better performances than AdaBoost in 
most cases. 

The benefits of our algorithms can be expressed in two-fold: (1) Given the same 
learning goal, our algorithms tend to use smaller number of weak classifiers. For ex- 
ample, from Figure 2 (2), if we want a classifier with a 0.995 detection rate and a 0.2 
false positive rate, AdaBoost needs at least 43 weak classifiers while AsymBoostTCi 
needs 32 and AsymBoostTC2 needs only 22. (2) Using the same number of weak clas- 
sifiers, our algorithms achieve a higher detection rate or a lower false positive rate. 
For example, from Figure 2 (2), using 30 weak classifiers, both AsymBoost TC1 and 
AsymBoostTC2 achieve higher detection rates (0.9965 and 0.9975) than AdaBoost 
(0.9945). 

Complete detectors Secondly, we train complete face detectors with AdaBoost, 
asymmetric-AdaBoost, AsymBoostTCi and AsymBoostxc2- All detectors are trained 
using the same training set. We use two types of cascade framework for the detector 
training: the traditional cascade of Viola and Jones [2] and the multi-exit cascade pre- 
sented in [11]. The latter utilizes decision information of previous nodes when judging 
instances in the current node. For fair comparison, all detectors use 24 nodes and 3332 
weak classifiers. For each node, 5000 faces + 5000 non-faces are used for training, and 
4832 faces + 5000 non-faces are used for validation. All non-faces are cropped from 
background images. The asymmetric factor k for asymmetric-AdaBoost, AsymBoostxci 
and AsymBoost T c2 are selected from {1.2, 1.5, 2.0, 3.0, 4.0, 5.0, 6.0}. The regulariza- 
tion factor 6 for AsymBoostxci and AsymBoostxc2 are chosen from { 
55' 155' 200 ' 400' 855' 1050}- II takes about four hours to train a AsymBoost TC face 
detector on a machine with 8 Intel Xeon E5520 cores and 32GB memory. Comparing 
with AdaBoost, only around 0.5 hour extra time is spent on solving the primal problem 
at each iteration. We can say that, in the context of face detection, the training time of 
AsymBoostxc is nearly the same as AdaBoost. 

ROC curves on the CMU/MIT data set are shown in Figure 3. Those images con- 
taining ambiguous faces are removed and 120 images are retained. From the figure, we 
can see that, asymmetric-AdaBoost outperforms AdaBoost in both Viola-Jones cascade 
and multi-exit cascade, which coincide with what reported in [3]. Our algorithms have 
better performances than all other methods in all points and the improvements are more 
significant when the false positives are less than 100, which is the most commonly used 
region in practice. 

As mentioned in the previous section, our algorithms produce sparse results to some 
extent. Some linear coefficients are zero when the corresponding weak classifiers sat- 
isfy the condition (10). In the multi-exit cascade, the sparse phenomenon becomes more 
clear. Since correctly classified negative data are discarded after each node is trained, 
the training data for each node are different. The "closer" nodes share more common 
training examples, while the nodes "far away" from each other have distinct training 
data. The greater the distance between two nodes, the more uncorrelated they become. 
Therefore, the weak classifiers in the early nodes may perform poorly on the last node, 
thus tending to obtain zero coefficients. We call those weak classifiers with non-zero 
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Table 1: The ratio of weak classifiers selected at the i-th node (column) appearing with 
non-zero coefficients in the j-th node (row). The ratios decrease along with the growth 
of the node index in each column. 



Node Index 
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0.86 


1.00 


0.97 


1.00 
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0.43 


0.93 


0.97 


0.97 
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0.71 


0.93 


0.90 


1.00 


0.96 


1.00 




















7 


0.43 


0.87 


0.87 


0.97 


0.92 


0.92 


1.00 


















8 


0.29 


0.40 


0.70 


0.73 


0.74 


0.88 


0.74 


1.00 
















9 


0.00 


0.27 


0.50 


0.60 


0.76 


0.72 


0.66 


0.67 


1.00 














10 


0.14 


0.27 


0.43 


0.60 


0.62 


0.70 


0.62 


0.66 


0.60 


1.00 












11 


0.00 


0.20 


0.33 


0.50 


0.52 


0.54 


0.60 


0.59 


0.56 


0.48 


1.00 










12 


0.14 


0.20 


0.40 


0.40 


0.56 


0.50 


0.54 


0.61 


0.55 


0.46 


0.36 


1.00 








13 


0.00 


0.13 


0.33 


0.37 


0.36 


0.54 


0.40 


0.47 


0.47 


0.46 


0.43 


0.25 


1.00 






14 


0.00 


0.07 


0.17 


0.40 


0.28 


0.50 


0.42 


0.49 


0.50 


0.53 


0.45 


0.43 


0.35 


1.00 




15 


0.00 


0.13 


0.20 


0.27 


0.36 


0.38 


0.46 


0.41 


0.52 


0.42 


0.49 


0.44 


0.34 


0.27 


1.00 



Table 2: Comparison of the numbers of the effective weak classifiers for the stage- 
wise boosting (SWB) and the totally-corrective boosting (TCB). We take AdaBoost and 
AsymBoostxci as representative types of SWB and TCB, both of which are trained in 
the multi-exit cascade for face detection. 



Node Index 
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3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 


18 


SWB 


7 


22 


52 


82 


132 


182 


232 


332 


452 


592 


752 


932 


1132 


1332 


1532 


1732 


1932 


2132 


TCB 




22 


52 


80 


125 


174 


213 


269 


331 


441 


464 


538 


570 


681 


717 


744 


742 


879 



coefficients "effective" weak classifiers. Table 1 shows the ratios of "effective" weak 
classifiers contributed by one node to a specific successive node. To save space, only 
the first 15 nodes are demonstrated. We can see that, the ratio decreases with the growth 
of the node index, which means that the farther the preceding node is from the current 
node, the less useful it is for the current node. For example, the first node has almost no 
contribution after the eighth node. Table 2 shows the number of effective weak classi- 
fiers used by our algorithm and the traditional stage-wise boosting. All weak classifiers 
in stage-wise boosting have non-zero coefficients, while our totally-corrective algorithm 
uses much less effective weak classifiers. 



5 Conclusion 

We have proposed two asymmetric totally-corrective boosting algorithms for object 
detection, which are implemented by the column generation technique in convex opti- 
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mization. Our algorithms introduce asymmetry into both feature selection and ensemble 
classifier learning in a systematic way. 

Both our algorithms achieve better results for face detection than AdaBoost and 
Viola-Jones' asymmetric AdaBoost. An observation is that we can not see great dif- 
ferences on performances between AsymBoostxci and AsymBoostTC2 in our experi- 
ments. For the face detection task, AdaBoost already achieves a very promising result, 
so the improvements of our method are not very significant. 

One drawback of our algorithms is there are two parameters to be tuned. For differ- 
ent nodes, the optimal parameters should not be the same. In this work, we have used 
the same parameters for all nodes. Nevertheless, since the probability of negative exam- 
ples decreases with the node index, the degree of the asymmetry between positive and 
negative examples also deceases. The optimal k may decline with the node index. 

The framework for constructing totally-corrective boosting algorithms is general, 
so we can consider other asymmetric losses (e.g., asymmetric exponential loss) to form 
new asymmetric boosting algorithms. In column generation, there is no restriction that 
only one constraint is added at each iteration. Actually, we can add several violated 
constraints at each iteration, which means that we can produce multiple weak classifiers 
in one round. By doing this, we can speed up the learning process. 

Motivated by the analysis of sparseness, we find that the very early nodes contribute 
little information for training the later nodes. Based on this, we can exclude some use- 
less nodes when the node index grows, which will simplify the multi-exit structure and 
shorten the testing time. 
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Fig. 2: Testing curves of single-node classifiers for AdaBoost, AsymBoostxci and 
AsymBoostxc2- All the classifiers use the same training and test data sets. (1) shows 
curves of detection rates (DR) with false positive rates (FPR) fixed to 0.25, (2) shows 
curves of FPR with DR fixed to 0.995. FPR or DR are evaluated at each weak classifier. 



14 





Fig. 3: Performances of cascades evaluated by ROC curves on the MIT+CMU data set. 
AdaBoost is referred to "Ada", and Asymmetric AdaBoost [2] is referred to "Asym". 
"Viola- Jones cascade" means the traditional cascade used in [3] . 



